Exploiting Multiple Sources of Evidence of Document Relatedness in Hybrid Search Engines: A Unifying Model and Design Proposal

نویسنده

Jonathan Furner

چکیده

Previous work on search-engine design has indicated that information-seekers may benefit from being given the opportunity to exploit multiple sources of evidence of document relatedness. Few existing systems, however, give users more than minimal control over the selections that may be made among methods of exploitation. By applying the methods of “document network analysis” (DNA), a unifying, graph-theoretic model of content-, collaboration-, and context-based systems (CCC) may be developed in which the nature of the similarities between types of document relatedness and document ranking are clarified. The usefulness of the approach to system design suggested by this model may be tested by constructing and evaluating a prototype system (UCXtra) that allows searchers to maintain control over the multiple ways in which document collections may be ranked and re-ranked. Exploiting Multiple Sources Furner 2 Introduction and Overview [I]f one has available several different representations of a single information problem, then it makes sense to use all of them, in combination, to improve retrieval performance, rather than try to identify and use only the best one. (Belkin, Kantor, Fox, & Shaw, 1995, p. 446) Finding a way to provide powerful search without overwhelming novice users is a current challenge. Existing interfaces often hide important aspects of the search (by poor design or to protect proprietary relevance-ranking schemes), or make query specification so difficult and confusing that they discourage use. Evidence from empirical studies shows that users perform better and have higher subjective satisfaction when they can view and control the search ... (Shneiderman, 1998, p. 515) Much about the design of search engines that provide access to networks of textual documents (collections of web pages, for instance, or scholarly papers, or library books) is well understood (see, for example, Belew, 2000). Many techniques for identifying the particular documents that are relevant to the needs of a given information-seeker are known to work reasonably effectively and efficiently, and are implemented in commercial search systems that are successful and popular. Such techniques include those based variously on content, collaboration, and context (to invoke a taxonomy that will be developed later in this paper). At the current time, however, it seems that no single retrieval system exists that offers the user the opportunity to take advantage of more than a few of these techniques in combination. To take as an example a system that is often cited by information professionals as their web search engine of choice, Google employs just one method of producing a ranked list of documents in response to a searcher’s query. Should the user wish to create a different ranking of documents based, say, on cocitation (“Pages that link to this one also link to ...”) rather than on similarity of content, that option is not available. The point here is not to criticize Google or any of the other systems that we might have used as examples. It is not clear, in any case, how useful or attractive the implementation of an additional facility for ranking by co-citation would be to the users of a search engine whose document database is a collection of web pages. The claim, rather, is that, in general, the searcher’s lot would be improved if any decision as to the choice of retrieval algorithm that should be used in a given circumstance were left to the searcher, and not imposed on the searcher by the system; and not only this, but that to have such control over the retrieval process is most satisfying and most easily exploited when it is offered within the environment of a single, integrated system equipped with a single, homogeneous interface to multiple, diverse retrieval mechanisms, rather than with any requirement continually to switch back and forth between systems. We may distinguish three types of approach that have been, or may conceivably be, taken in tackling the problem of search-technique selection. The point at which approaches of the three types differ is in the assumption they make as to the locus of expertise. One type of approach, for instance, might be termed the “expert designer” solution, since the method by which multiple sources of evidence of document relevance are considered in combination, in order to produce a single ranked list of documents, is controlled by the system designer---who may well be formulating her algorithms for data fusion on an ad hoc, trial-and-error basis. Metasearch systems that attempt to solve both the collectionselection problem (of identifying the most promising databases in which to carry out searches) and the collection-fusion problem (of merging the results received from multiple databases) commonly proceed in this way (Aslam & Montague, 2001). Another kind of approach may be termed the “expert system” solution. The thinking here is that, if the system is able somehow to infer a characterization of the individual searcher, such a characterization may then be matched against a repertoire of search procedures so that the procedure appropriate to the given context may be selected and applied (Mladenic, 1999). Notwithstanding the Exploiting Multiple Sources Furner 3 progress that has been made in the design of intelligent agents, however, we yet have little knowledge of which combination of procedures is most effective under which circumstances, for which user groups, given which kinds of information need, in which contexts. Rather than attempting to construct a system that improves on the “one-size-fits-all” philosophy of its predecessors by its ability automatically to derive, from whatever minimal knowledge it has of the user, a determination of the particular retrieval mechanism that is most appropriate in a given situation, it is suggested that we should instead focus on the design of systems which take direct advantage of the individual searcher’s conscious ability to make reasoned decisions as to the type of search procedure that should be followed by the system in any given context or at any particular stage in the search process. This latter approach might be called the “expert user” solution, although doing so would overstate the level of expertise of the typical searcher. The “expertise” that is leveraged is simply the searcher’s perception (what we might otherwise count as knowledge) of distinctions between search procedures that work well and ones that do not. In the study from which the first quotation at the head of this section was taken, Belkin et al. (1995) were concerned primarily to evaluate the effectiveness of merging the results retrieved by multiple different representations of the same information need. It seems reasonable to hypothesize that their conclusion may be generalizable---that if the searcher is able to express their need in multiple ways, or (which is the same thing?) to identify multiple ways of satisfying that need, then for maximum effectiveness the system should be designed to be capable of taking advantage of that richness of interaction. Yet we might also argue that we should not necessarily focus on automated methods of combining or fusing multiple sources of evidence of relevance, since it might be the case that searchers are good at identifying the sources that are most profitable in given contexts. The “empirical studies” that Shneiderman cites in the second quotation at the head of this section are those of relevance-feedback methods undertaken by Koenemann and Belkin (1996), who concur that “a central question for the design of interactive systems in general is the amount of knowledge a user is required or expected to have about the functioning of the system and the level of control a user can exert” (p. 206). In their experiments, Koenemann and Belkin compared the effectiveness and popularity of a number of feedback techniques, each conceived as occupying a different point on a spectrum ranging from “completely hidden” to those under the users’ “complete control” (p. 206), and found that their subjects “routinely expressed their desire to ‘see and control’” what the mechanisms did (p. 212). Just as systems that employ relevance-feedback techniques exploit the searchers’ personal perceptions of the relevance of individual documents, systems designed along the lines proposed in this paper analogously exploit searchers’ personal perceptions of the usefulness of particular retrieval algorithms. My suggestion, then, is that it would appear to make sense to examine the following dual hypothesis: not only (a) that we may, in certain circumstances, improve the effectiveness of information retrieval by bringing multiple retrieval techniques to bear on single instances of information need; but also (b) that we may, in certain circumstances, further improve the effectiveness of a system implementing multiple techniques by offering the user the opportunity to make her own dynamic, interactive selection from amongst the available techniques. In this paper, I present some of the theoretical foundations that may serve as a conceptual basis for empirical tests of the foregoing claims. I intend to do the following: 1. draw the rough boundaries of “document network analysis” (DNA) as a newly autonomous territory within the broader field of information studies; 2. use the methods of DNA to establish a framework for the description and explanation of retrieval algorithms, involving a presentation of the most basic mathematical techniques for the representation and analysis of document networks, and a specification of a unifying “content-collaboration--context” (CCC) model of retrieval systems; Exploiting Multiple Sources Furner 4 3. elucidate the concepts of relatedness and rank, distinguishing (a) among the multiple ways in which documents may be said to be related, and (b) among the correspondingly multiple ways in which collections of documents may be ranked; 4. outline the hypotheses (a) that information-seekers may benefit from being given the opportunity to exploit multiple sources of evidence of document relatedness, and (b) that the most difficult problem involved in any attempt to produce this benefit may lie in the design of the interface rather than in the specification of any method used to fuse or combine those multiple sources of evidence; and 5. introduce a prototype system, called UCXtra, that will be used to test the claims made here. Document Network Analysis Within the broad discipline of information studies (Cornelius, 1996), it is becoming increasingly useful to delineate and refer to a particular confluence of research questions by applying a new, encompassing label: document network analysis. The subfield denoted by this term (which abbreviates, happily or otherwise, to “DNA”) may be conceived as originating at the intersection of four existing specializations: (a) information retrieval (IR) systems analysis and design (Salton & McGill, 1983; Sparck Jones & Willett, 1997), (b) bibliographic (library) classification, a.k.a. knowledge organization (Ranganathan, 1937; Svenonius, 2000), (c) hypertext (Nielsen, 1993; Ashman & Simpson, 1999) and (d) bibliometrics (Egghe & Rousseau, 1990; Borgman & Furner, in press). A snapshot of the subjects of concern to members of each of these academic communities may be obtained by examination of the most recent proceedings of the main conference in the field: those of ACM SIGIR, ISKO, ACM SIGWEB, and ISSI, respectively. DNA: Assumptions, Goals, and Methods A central, common assumption made by scholars working in each of these areas is that the primary objects of study---documents in which human thought is recorded, and through which representations such thought is communicated---are objects that are created, organized, transferred, and used on a mutually dependent basis. No individual document exists independently of all others; every document is related to every other, in various ways, and to extents that vary both with time and with the identity of the observer. The universe of documents may be viewed as a vast, dynamic network in which individual documents are situated with regard to one another through specification of the relationships that are perceived to exist among them at given times. In traditional information retrieval, for example, such relationships are typically identified through analysis of the terms that different documents have in common (Fairthorne, 1956); in the theory of bibliographic classification, each document class is populated by member documents that stand both in a certain relationship to one another, and in a differing relationship to the members of every other class (Classification Research Group, 1955); the links connecting the nodes of a hypertext are explicit representations of inter-document relationships (Conklin, 1987); and a basic technique in bibliometrics involves counting the citations by which document authors indicate connections between their own work and that of others (Price, 1965; Garfield, 1979). Effective information retrieval is the primary goal of DNA. The process of information retrieval may be defined in broad, general, user-oriented terms as an active, cognitive, human one by which problems are solved, sense is made, meanings are constructed, needs are satisfied, gaps are bridged, or anomalies are resolved (Kuhlthau, 1993). In narrower, system-oriented terms, information retrieval is the process of identifying, from among the documents making up a large collection, those that are “relevant” (Vickery, 1959) or of “utility” (Cooper, 1973)---i.e., those that are wanted or preferred by the human searcher, or perceived by her to be related to her interests in a given manner at a given time (Schamber, 1994), by (for example) being the cause of change in her cognitive state (Harter, 1992). The task of any system whose function is to support this process is to name, set, and match: (a) to describe each document by naming each member of the set of classes to which the document belongs (i.e., by identifying the terms Exploiting Multiple Sources Furner 5 that serve as the labels of those classes); (b) to position or set each document in relation to every other by analysis of shared class memberships; and (c) to rank documents on the basis of the degree to which their representations are deemed to match an analogously structured representation of the information need of a searcher. Each step of this process corresponds roughly to the application of a distinctive body of methods central to DNA. The first phase, known variously as that of subject, content, text, or document analysis (Lancaster, 1998), in which documents are indexed or classified on the basis of their content or other characteristics, is the domain of classification theory. In the second phase, document networks are modeled as hypertexts, making use of concepts drawn from the branch of mathematics known as graph theory (Harary, Norman, & Cartwright, 1965), which may be used to model any structure consisting of a set of objects (vertices, or nodes) and a set of relationships (edges, or links) among those objects (Dipert, 1997). In the third phase, the previously-identified features of documents and of document pairs are counted and analyzed using both algebraic and statistical methods; simply on account of the book-like nature of the objects under consideration, methods of the latter type are often conveniently grouped under the heading of bibliometrics or informetrics (Egghe & Rousseau, 1990). In summary, the identity and unity of DNA are derivative of the following shared viewpoints of its practitioners: (a) an epistemological stance that accepts the reality of causal connection between structure and process, and a corresponding conviction that theory about the latter may be tested by examination of the former; (b) an ontological perspective that privileges the relational properties of objects; and (c) a methodological preference for mathematical methods of structural analysis. DNA: Research Questions The research questions that drive the study of DNA may be classified by locating each of them on the appropriate intersection of three axes. 1. Applied vs. methodological. Applied questions are about the nature of, and processes associated with, real-life document networks; methodological questions are about the approaches that may be taken in attempts to answer applied questions. 2. Descriptive (explanatory) vs. prescriptive (normative, evaluative). Descriptive questions are about how things are; prescriptive questions are about how we think things should be. 3. General vs. specific. General questions are about networks (or methods) of certain kinds, types, or classes; specific questions are about given, individual networks in particular. Some applied questions of a descriptive and general nature, for example, are: • What are the structural characteristics of document networks? What are the ways in which documents may be said to be related to one another? In what kinds of order may documents be placed? How may the relationships existing among the documents in a network be classified? • What are the processes through which document networks are created, maintained, and exploited? • In what ways are the documents in networks used? The results of applying the methods of DNA to general, descriptive, applied questions of this type will typically be a typology of some kind. Alternately, descriptive, applied questions may be specific rather than general: • What are the structural characteristics of, and factors influencing the development of, a given document network? What is the structure, shape, or size of network X? What are the possible rankings of the documents in network X? Exploiting Multiple Sources Furner 6 The results of applying the methods of DNA here may be some kind of representation or model of a particular structure or ranking. Some applied questions of a prescriptive nature, on the other hand, are: • How should systems providing access to, or enabling the exploitation of, document networks be designed and managed so that the quality of interaction between humans and documents is improved? • What is the optimal design of a given document-network access system? Methodological questions take the following forms: • What kinds of method are appropriate for use in attempts to answer applied questions? What kinds of document analysis may be carried out (a) in order to augment our understanding of the ways in which documents are used (descriptive), or (b) in order to improve the quality of document usage (prescriptive)? What methods can we use to identify and represent (a) types of inter-document relationship or document ranking (general), or (b) individual relationships or rankings (specific)? • What methods are best? And what does “best” mean in this context? • What may we learn from cognate disciplines in which networks of other kinds are studied? In the next two sections of this paper, it is demonstrated how a small set of generic methods can be used both to produce representations of document networks and to develop a typology of document rankings. In a subsequent section, this typology is exploited in the proposal of a novel answer to the general, prescriptive, applied question of how we may seek, through the making of appropriate design choices, to optimize the quality of human--document interaction. Graph-Theoretic Representations of Document Networks: The CCC Model The use of mathematical techniques of graph theory and matrix algebra in the modeling of document networks has been well documented over several decades, and often mirrors the prior application of such methods in the related fields of numerical taxonomy (Sneath & Sokal, 1973), chemical structure analysis (Willett, 1987; Downs & Willett, 1996; Paris, 1997; Willett, 2000), social network analysis (Wasserman & Faust, 1994; Haythornthwaite, 1996), and cognitive psychology (Schvaneveldt, Dearholt, & Durso, 1988). Graph theory has been explicitly applied in the analysis of hypertext structures (Tompa, 1989; Parunak, 1991; Botafogo, Rivlin, & Shneiderman, 1992; Furner, Ellis, & Willett, 1996; Chen, 1998); citation structures (Garner, 1967; Zunde, 1971; Cummings & Fox, 1973; Small & Griffith, 1974; Shepherd, Watters, & Cai, 1990; Small, 1999); textual identity networks or “bibliographic families” (Leazer & Furner, 1999); and the World Wide Web (Kleinberg, Kumar, Raghavan, Rajagopalan, & Tomkins, 1999; Barabási, Albert, & Jeong, 2000; Broder et al., 2000); as well as in models of semi-structured data (Abiteboul, Buneman, & Suciu, 2000); metadata (Lassila, 1998; Biezunski & Newcomb, 2001); and semantic (conceptual) networks (Priss, 1998). Similarly, matrix algebra has long been a core component of methods of automatic classification (clustering) of documents (Van Rijsbergen, 1979; Willett, 1988; Shaw, 1991); citation analysis (Yagi, 1965; Pinski & Narin, 1976; Noma, 1984; Doreian, 1994). More recently, it has been put to work in the identification of “authoritative” websites (Brin & Page, 1998; Kleinberg, 1999). At an even greater level of generality, basic vector-processing techniques involving the measurement of the degree of similarity between single vectors lie at the heart of content-based (Salton & McGill, 1983; Salton & Buckley, 1990), collaborationbased (Resnick, Iacovou, Suchak, Bergstrom, & Riedl, 1994; Soboroff & Nicholas, 2000), and contextbased (Croft & Thompson, 1987; Salton & Buckley, 1991; Melucci, 1999) retrieval systems. All these methods are central to the emerging field of “cybermetrics” or “webometrics,” in which quantitative techniques are used to analyze processes related to electronic documents and digital libraries (Almind & Ingwersen, 1997; Björneborn & Ingwersen, 2001; Cronin, 2001). Exploiting Multiple Sources Furner 7 Here we intend simply to present the most fundamental applications of these techniques in a manner that highlights what we perceive to be the most important points of commonality between retrieval-system types that are often considered discretely, by designers and by users both. The rationale for spending more than a brief time in developing this simple framework is provided by the hope that, after its extended presentation, some will find it useful as a lens through which the diverse contributions of members of the historically overlapping but separate communities of designers of content-based IR systems, collaboration-based filtering systems, and context-based hypertext systems, may be viewed, interpreted, and integrated. Such an outcome accords, at a level of specificity, with a more-general vision inspired by the identification of document network analysis as a coherent grouping of research problems and methods. We can begin by referring to Figure 1, which depicts a diagrammatic representation of a small document network, used as an example for application of the operations to be described below. The example network consists of the following six sets of elements (three sets of objects, and a further three sets of object-pairs): 1. a set of five documents (rectangles in the diagram), each assigned a label from A–E; 2. a set of four term-types (ovals), each assigned a label from 1–4; 3. a set of six human information-seekers or judges (circles), each assigned a label from 1–6; 4. a set of ten document--term pairs (dotted lines); 5. a set of sixteen judge--document pairs (dashed lines); and 6. a set of eight “links” or ordered document--document pairs (solid lines). We can say that N = 5 (where N is the number of documents), M = 4 (where M is the number of termtypes), R = 6 (where R is the number of judges), and p = 8 (where p is the number of links). Each of the three sets of pairs may be represented in matrix form, and the three resulting matrices are presented in Figure 2. Each matrix is populated with binary values: the value of the cell at the intersection of row i and column j is 1 if object i and object j form a pair {i, j}, or 0 if not. (In the third case, the value of the cell at that intersection is 1 if object i and object j form a directed or ordered pair , or 0 if not.) In this way, we may derive a document--term content matrix (M1), a judge--document approval matrix (M2), and a document--document adjacency matrix (M3). Every row and column in each matrix may be considered on its own as a vector or n-tuple of elements, where n is the number of elements; such a vector may be used as a representation of object i (if a row vector) or object j (if a column vector), and may also be considered as a set of attribute--value pairs, where the attributes are the labels of the columns (if a row vector) or rows (if a column vector). The binary data in these matrices indicate the existence or non-existence of a pairing: in matrix M1, the presence or absence of a particular document in the set to which a particular term-type has been assigned; in M2, the approval or non-approval of a given document by a given judge; and in M3, the adjacency or non-adjacency of one document to another. If we wished to record a richer representation of the structure of the network, we could just as easily use non-binary values to indicate the weight of a term within a document description, the extent of a judge’s approval, or the strength of a citation or hypertext link. We can go on to distinguish between three basic types of retrieval system, each of which uses a different matrix as base data for further manipulation: 1. content-based systems (“traditional” IR systems), whose base data is the content of a document-term content matrix (M1; Fairthorne, 1956); 2. collaboration-based systems (also known as collaborative filtering or recommender systems), based on a judge--document approval matrix (M2; Resnick et al., 1994); and 3. context-based systems (also known as link-analytic or hypertext IR systems) based on a document-document adjacency matrix (M3; Croft & Thompson, 1987). Exploiting Multiple Sources Furner 8 For convenience, we may refer to the unifying graph-theoretic model that encompasses systems of each of these three types as the CCC (content--collaboration--context) model. In all of these systems, the base data is manipulated for the same purpose (that is, in order to produce document rankings), and in the same ways (that is, using the same few, simple matrix operations). This purpose, and these operations, are described in the next section. Much of the explanation provided in this and subsequent sections is deliberately repetitive in order to emphasize the similarities between techniques applied in different contexts. Retrieval as Ranking Relatedness, Ranking, and Relevance A central tenet of document network analysis (see earlier section) is that every document is related to every other. Indeed, its practitioners take a stance that explicitly emphasizes the ontological significance of the relationships that exist among objects or entities of all kinds, such as documents, terms, and people. Every object stands in relationships of numerous kinds with every other. But what varies is not only the kind, but the strength of relationships---i.e., the degree of relatedness, the extent to which a particular pair of objects are related in a particular way. Objects may be ranked in order of their relatedness to certain others. For example: One kind of relationship that structures networks of documents and people is that of approval; a given judge may approve of a given document to a certain extent at a given time. Documents may thus be ranked in order of the extent to which they are approved by that judge. Of course, people may approve or disapprove of documents for many different reasons, or on the basis of many different criteria. One such criterion is the perceived informativeness of the document---i.e., the extent to which it is perceived to cause (or to be likely to cause) change of a desirable kind in the judge’s cognitive state (Harter, 1992). We might say that documents that are approved for their perceived informativeness have the property of being subjectively relevant, or that the relationship between person and document is one of relevance. In this sense, relevance is construed as a kind of relationship between person and document, and the degree to which a given document is related to a given person in this way is assumed to be dependent on that person’s subjective estimation of the probability that that document is informative. But informativeness is not the only criterion on which people decide whether they approve of a document or not. The behavior of a person in judging relevance, or in evaluating a document in readiness for recording an expression of the degree to which they approve it, may thus be characterized as follows: (a) as individual and subjective---in that different people, even when placed in otherwise similar situations and taking into account similar factors, will make different decisions; (b) as complex and multidimensional--in that single decisions are often based on multiple factors, and multiple kinds of factors, simultaneously; and (c) dynamic and situational---in that, on different occasions or when placed in different situations, people take account of different factors and make different decisions (Schamber, Eisenberg, & Nilan, 1990). However we specify the general function of a document retrieval system---in terms, perhaps, of its provision of support for problem-solving, sense-making, meaning-construction, need-satisfaction, gapbridging, or anomaly-resolving---the specific function of the particular component of the system that is typically known as the “retrieval mechanism,” and that makes use of a procedure of a type known as the “ranking algorithm,” is to select an order in which to present the documents in the collection to the searcher. In other words, the function of the retrieval mechanism is to map the relatively complex network structure of the document collection to some ranked list that is simply linear rather than networked in structure. This mapping function is understood to be necessary given the assumption that the searcher, limited by the linear nature of time itself, is able to view and process a set of multiple documents only in serial fashion, one by one. The order chosen by the ranking algorithm acts as the system’s best guess as to the order that would be selected by the searcher if they were somehow to have prior knowledge of the degree to which each document is actually relevant to them in the current context. Exploiting Multiple Sources Furner 9 Rankings and Queries A query may be understood as a set of attribute--value pairs (i.e., a vector) identified by a searcher for use as self-representation, and communicated as such to the system. We may distinguish immediately between (a) a representation that has existed for some period of time prior to its selection as a query by the searcher, and (b) one that is formulated by the searcher herself, at the time of its selection. We may call queries of the first kind pre-defined (or perhaps “queries-by-example”), and queries of the second kind user-defined. It should be noted that all queries, of either kind, are user-selected; the focus on pre-definition and user-definition partly captures a related distinction, often made in the literature of human--computer interaction (HCI), between selection-by-recognition and selection-by-recall. We may distinguish between two kinds of ranking on the basis of the extent to which the ranking is personalized to the individual searcher. Rankings of the first kind, which we will from now on call Type I rankings, are not personalized in this sense: To produce such a ranking, the system does not require the searcher to supply a query of any kind (whether pre-defined or user-defined) to which the system should respond; at a given time t, the same Type I ranking would be produced whoever the searcher were and whatever their particular needs. Conversely, rankings of Type II are personalized to the extent that their production rests on the searcher’s ability and willingness to communicate a representation (pre-defined or user-defined) of their information need to the system, or on the system’s ability automatically to derive such a representation from available data about the individual searcher’s characteristics or prior activities. Type I (Non-Personalized) Rankings In this section, we shall describe nine methods of producing a Type I ranking, each method utilizing a different source of base data. Two methods begin with the data given in a document--term content matrix (M1); two with the judge--document approval matrix (M2); and five with the document-document adjacency matrix (M3). The core computation in two of the methods is simple row summation; in two, it is column summation; the remaining three involve more-complex matrix operations such as the derivation of eigenvectors. These distinctions are summarized in Table 1. Type I rankings based on content data Any content-based system may immediately produce Type I rankings of two sorts without requiring as input any characterization of an individual searcher. One such ranking is of documents, and is by document length (RI:1 in Table 1)---where the “length” of a given document is computed as the sum of the values in that document’s row-vector in the content matrix (perhaps normalized by the total sum of all row-sums). In the example, the normalized row-sums of the content matrix are given by the vector <0.30, 0.20, 0.20, 0.20, 0.10>; the resulting ranking is thus specified by . The other such ranking is of terms, and is by term frequency---where the “frequency” of a given term is computed as the sum of the values in that term’s column-vector in the content matrix (again, perhaps normalized by the total sum of all column-sums). In the example, the normalized column-sums of the content matrix are given by the vector <0.20, 0.30, 0.30, 0.20>; the resulting ranking is specified by <2 | 3, 1 | 4>. Of course, it is difficult to imagine a circumstance in which any ranking based solely on document length would be effective as an approximation to the preference ordering of the individual searcher; but both document lengths and the within-collection frequencies of terms are commonly employed as normalizing elements in formulae (such as that known as TF.IDF) of the kind that are used to weight the values making up document vectors in non-binary versions of document--term matrices, in preparation for the derivation of Type II rankings (Salton & Buckley, 1988). Type I rankings based on approval data In an analogous manner to that used to derive Type I rankings from content data, there are two sorts of such rankings that may be derived from approval data. On the one hand, documents may be ranked by their popularity (RI:2 in Table 1)---where the “popularity” of a given document is computed as Exploiting Multiple Sources Furner 10 the sum of the values in that document’s column-vector in the approval matrix (perhaps normalized by the total sum of all column-sums). The greater the number of judges that have approved the document, the higher that document’s popularity score. In the example, the normalized column-sums of the approval matrix are given by the vector <0.31, 0.19, 0.13. 0.19, 0.19>; the resulting ranking is specified by . On the other hand, judges may be ranked by their indiscrimination---where the “indiscrimination” of a given judge is computed as the sum of the values in that judge’s row-vector in the approval matrix (again, perhaps normalized by the total sum of all row-sums). The greater the number of documents that have been approved by a judge, the higher that judge’s indiscrimination score. In the example, the normalized row-sums of the approval matrix are given by the vector <0.19, 0.13, 0.19, 0.19, 0.06, 0.25>; the resulting ranking is specified by <6, 1 | 3 | 4, 2, 5>. A ranking of documents based on popularity scores---the equivalent of a list of “bestsellers” or a “hit parade”---may be of inherent interest to the individual searcher. Additionally, such a ranking may be used to modify various Type II rankings, as may a ranking of judges by indiscrimination. Type I rankings based on adjacency data Two sorts of Type I rankings may similarly be derived from an adjacency matrix. Documents may be ranked by their citivity (RI:3 in Table 1)---where the “citivity” of a given document is computed as the sum of values in that document’s row-vector in the adjacency matrix (perhaps normalized by the total sum of all row-sums). The greater the number of target (destination or cited) documents that are linked to by a given source (origin or citing) document, the higher that source document’s citivity score. The row-sum associated with a document in an adjacency matrix is sometimes known as that document’s out-degree. If the source document is a web page, its citivity score is given by the number of outgoing links on that page; if the source document is a scholarly paper, its citivity score is given by the number of references in its bibliography. In the example, the normalized row-sums of the adjacency matrix are given by the vector <0.13, 0.13, 0.25. 0.38, 0.13>; the resulting ranking is specified by . Documents may also be ranked by their citedness (RI:4)---where the “citedness” of a given document is computed as the sum of values in that document’s column-vector in the adjacency matrix (again, perhaps normalized by the total sum of all column-sums). The greater the number of source documents that link to a given target document, the higher that target document’s citedness score. The column-sum associated with a document in an adjacency matrix is sometimes known as that document’s in-degree. If the target document is a web page, its citedness score is given by the number of incoming links to that page; if the target document is a scholarly paper, its citedness score is given by the number of other documents that include a reference to that target document in their bibliography. In the example, the normalized column-sums of the adjacency matrix are given by the vector <0.38, 0.25, 0.13, 0.13, 0.13>; the resulting ranking is specified by . Document citivity, like document length, is seldom likely to be chosen by a searcher as the sole criterion on which to base a ranking of the documents in a collection. However, as well as playing a useful role in searches for review articles, it may be employed as a normalizing element in Type II rankings of all kinds. Document citedness, in contrast, is often viewed as a surrogate for document “quality,” and may well be of inherent, standalone interest to the searcher in a manner similar to that in which document popularity is commonly perceived to be an interesting characteristic. At least three other Type I rankings of a rather different kind may be derived from adjacency data. The rationale for producing these rankings involves extensions of the idea that documents may usefully be ranked by citedness (i.e., by the extent to which they are cited), to (in the first case) the idea that documents may even more usefully be ranked by the extent to which they are cited by documents that are themselves highly cited, and (in the second and third cases, respectively) the idea that documents may instead be ranked (i) by the extent to which they are cited by documents that cite other highly-cited documents, or (ii) by the extent to which they cite documents that are cited by other highly-citive documents. Exploiting Multiple Sources Furner 11 In the Google system developed by Brin and Page at Stanford (to take the most well-known example of a context-based system that implements a Type I ranking of this kind), web pages are initially ranked in order of their “PageRank” (RI:5 in Table 1; Brin & Page, 1998); in the Clever system designed by Kleinberg and colleagues at IBM Almaden, web pages are initially ranked both in order of their “authority” (RI:6), and in order of their “hubness” (RI:7; Chakrabarti et al., 1999). In Google, pages that are linked to by pages with high PageRanks themselves have high PageRanks; in Clever, (i) pages that are linked to by pages with high hub weights themselves have high authority weights, and (ii) pages that link to pages with high authority weights themselves have high hub weights. The recursive calculations involved in determining a document’s PageRank, authority weight, or hub weight, are more complex than simple rowor column-summation, since each requires the derivation of the eigenvector of an appropriate modification of the initial adjacency matrix. Efficient algorithms for deriving eigenvectors from matrices of large N × N are, however, well-developed, and their application in this context is described in detail in the papers by Brin and Page (1998) and Kleinberg et al. (1999). In our example, document PageRanks are given by the vector <0.26, 0.13, 0.09, 0.26, 0.26>; document authority weights by <0.44, 0.36, 0.20, 0.00, 0.00>; and document hub weights by <0.00, 0.20, 0.36, 0.44, 0.00>. The resulting rankings are specified by ; ; and , respectively. All Type I rankings for our small, hypothetical network are summarized in Table 2. Type II (Personalized) Rankings In this section, we shall now describe eight methods of producing a Type II ranking, each method utilizing a different source of base data. Two methods begin with the data given in a document--term content matrix; two with the judge--document approval matrix; and four with the document--document adjacency matrix. The core computation in three of the methods is simple row comparison; in three, it is column comparison; the last two involve a more-complex matrix operation that requires the identification of the shortest paths between the nodes in a graph. These distinctions are summarized in Table 3. One operation used in the production of Type II rankings is row comparison. This involves treating the set of values in each row of a matrix as a separate vector or n-tuple, and evaluating for every row--row pair the degree of similarity between its two members, using a similarity metric such as the cosine coefficient (Ellis, Furner-Hines, & Willett, 1993). Taking as an example the document--term content matrix (M1 in Figure 2), whose rows represent documents, we can evaluate the degree of similarity between every document di and every other document dj by comparing the row vectors of the original base-data matrix. We ultimately arrive at a document--document co-indexing matrix (M1a in Figure 3) of size N × N, each (i,j)th element of which is a value representing the degree of similarity between documents di and dj (Salton, 1963). Similarly, by row comparison we may derive from the judge-document approval matrix (M2), whose rows represent judges, a judge--judge consistency matrix (M2a) of size R × R, each (i,j)th element of which represents the degree of similarity between judges ri and rj (Lesk & Salton, 1968); and we may derive from the document--document adjacency matrix (M3), whose rows represent “citing” documents, a document--document coupledness matrix (M3a) of size N × N, each (i,j)th element of which represents the degree of similarity between citing documents di and dj (Weinberg, 1974). The second operation is column comparison. This time, it is the set of values in each column of a matrix that is treated as a separate vector, and the degree of similarity between columns that is computed. By column comparison we may derive the following: from the document--term content matrix (M1), whose columns represent terms, a term--term co-occurrence matrix (M1b) of size M × M (Sparck Jones, 1971); from the judge--document approval matrix (M2), whose columns represent documents, a document--document co-approval matrix (M2b) of size N × N (Resnick et al., 1994); and from the document--document adjacency matrix (M3), whose columns represent “cited” documents, a document-document co-citation matrix (M3b) also of size N × N (Salton, 1963; Small, 1973). The Type II rankings derived from these and other operations are described in more detail below. Exploiting Multiple Sources Furner 12 Type II rankings based on content data Once supplied with a query made up of a set of terms, which may or not take the form of an existing document, and which may in either case be represented as a vector of values in the same way as any other document, a content-based system may produce Type II rankings, personalized to the queryselector, of three sorts. One such ranking is based on a simple comparison of the query vector with every other document (i.e., row) vector in the document--term content matrix (M1), by which a score indicating degree of similarity is calculated for each query--document pair. Such scores may be recorded as values in a document--document co-indexing matrix (M1a). A linear list of the documents in the collection may then be produced in which documents are ranked in order of their computed co-indexing scores (RII:1). We might say that top-ranked documents in this list are related to the query document in the sense that “terms assigned to the query document were also assigned to” the documents with the highest scores. This procedure may be viewed as a one-step process, in which a query vector is compared directly with all other document vectors. An alternative ranking (RII:1+) may be produced by a two-step process, in which the original query vector is first expanded or reformulated through comparison of its terms with those in a thesaurus created automatically by manipulation of the data in the term--term cooccurrence matrix (M1b), before comparing the resulting, expanded query with all other document vectors (Van Rijsbergen, 1977). We might say that top-ranked documents in the resulting list are related to the query document in the sense that “terms like the ones assigned to the query document were also assigned to” the documents with the highest scores. A third kind of content-based ranking (RII:1*) involves a similar, two-step technique for automatic or semi-automatic query reformulation, but relies on the individual searcher’s supply of feedback about the actual relevance of viewed documents (Salton & Buckley, 1990), rather than on the data contained in thesauri (whether manually or automatically constructed). In this case, an initial query vector representing the current searcher’s approval decisions is taken from the judge--document approval matrix (M2). Approval decisions may be obtained either (a) explicitly, by inviting searchers to mark or tag retrieved documents as relevant or otherwise, or similarly to rate them on a non-binary scale, or (b) by recording decisions to view, download, print, or purchase, as implicit expressions of approval. The contents of the documents rated most highly by the searcher are then analyzed in order to produce a vector that acts as a single representative of that highly-rated set. There are many methods of constructing such a vector: But, in whatever way this modified query is obtained, the comparison of modified query with other document vectors in the document--term content matrix (M1) to change the co-indexing matrix (M1a) proceeds in the basic manner. Again, we might say that top-ranked documents in the resulting list are related to the query document(s) in the sense that “terms like the ones assigned to the query document were also assigned to” the documents with the highest scores. Type II rankings based on approval data A collaboration-based system may similarly produce Type II rankings, personalized to the queryselector, of two sorts. One such ranking is possible once the searcher has selected a query document, which the system can represent as a vector of the approval ratings of that document supplied by different judges. As we saw above, such approval ratings may be gathered either explicitly, by asking the system’s users to indicate the degree to which they approve of given documents, or implicitly, by equating purchasing decisions with positive approval. The one-step ranking of documents is based on a simple comparison of the query vector with every other document (i.e., column) vector in the judge--document approval matrix (M2), by which a score indicating degree of similarity is indicated for each query-document pair. Such scores may be recorded as values in a document--document co-approval matrix (M2b). A linear list of the documents in the collection may then be produced in which documents are ranked in order of their computed similarity scores (RII:2). We might say that top-ranked documents in this list are related to the query document in the sense that “people who approved (liked, bought) the query document also approved” the documents with the highest scores. Exploiting Multiple Sources Furner 13 An alternative ranking of documents may be produced by a two-step process, in which the searcher is treated as the initial query, and represented by the judgments that they have previously made about documents in the collection. In other words, the current searcher is considered as a judge, and the approval ratings that they have previously supplied are recorded as a row-vector that may be compared with the row-vectors representing other judges in the judge--document approval matrix (M2). The result of this first step of the process is the computation of a row of the judge--judge consistency matrix (M2a), and thus a ranking of judges in order of their similarity to the “query” judge. We might say that topranked judges in this list are related to the query judge in the sense that “documents approved by the query judge are also approved by” the judges with the highest scores. The second step involves a comparison of the top-ranked judges’ row-vectors in the original approval matrix (M2) with that of the query judge, in order to identify those documents that, while approved by the top-ranked judges, are not approved by the query judge---perhaps because the query judge has not previously had the chance to record their level of approval of those documents. The product of this second step is a list of such documents, ranked in order of the degree to which they are approved by the judges found (in the first step) to be most similar to the query judge. We might say that top-ranked documents in this list (RII:2*) are related to the ones already approved by the query judge, or that “people like the query judge also approved (liked, bought)” the documents with the highest scores. Type II rankings based on adjacency data Finally, a context-based system may similarly produce Type II rankings, personalized to the query-selector, of at least six sorts. Rankings of two kinds are produced by a one-step process that is possible once a selection has been made by the searcher of a query document. Any such document may be represented by the system either (a) as a vector of values indicating the existence within the query document of links to other documents, in which case the vector may be compared with the row-vectors representing other “citing” documents in the original adjacency matrix (M3), or (b) as a vector of values indicating the existence within other documents of links to the query document, in which case the vector may be compared with the column-vectors representing other “cited” documents in the original adjacency matrix. In either case, scores indicating degree of similarity may be computed for each document-document pair; in the first case, these scores are recorded as values in a document--document coupledness matrix (M3a); in the second case, the scores are recorded in a document--document co-citation matrix (M3b). Two linear lists of the documents in the collection may then be produced in which documents are ranked in order of their computed similarity scores (RII:3; RII:4). We might say that top-ranked documents in the first list are related to the query document in the sense that “documents cited by the query document are also cited by” the documents with the highest scores; whereas top-ranked documents in the second list are related to the query document in the sense that “documents that cite the query document also cite” the documents with the highest scores. Just as we may produce alternative content-based and approval-based rankings by the two-step processes described above, we can also produce alternative adjacency-based rankings by considering a combination of adjacency and approval data. If approval decisions have been made by the current searcher in the course of past search sessions, then the documents that have already been highly rated may be used as the starting point for the identification of documents that are highly coupled or co-cited with them. Moreover, if approval decisions have previously been made by other searchers, then the documents that have already been highly rated by those searchers whose approval profiles are most similar to the current searcher may be used as the base documents instead. Linear lists of documents may then be produced in which documents are ranked in order of the degree to which they are coupled (RII:3*) or cocited (RII:4*) with documents that are highly rated, either by a given judge or by judges displaying high levels of consistency with the given judge. Type II rankings of two further kinds may be derived from adjacency data. The process involves, first, constructing a document--document distance matrix (M3c; Botafogo et al., 1992; Furner et al., 1996) by calculating, for every document--document pair, the length of the shortest path between the members Exploiting Multiple Sources Furner 14 of that pair. Then, given a query document, the documents in the collection may simply be ranked in order of their distance from that query document. In a directed graph, we may distinguish between forward distance (RII:5) and backward distance (RII:6): Forward distance is the length of the shortest path between documents di and dj that traverses only “forward” links; backward distance is the length of the shortest path between documents di and dj that traverses only “backward” links. The forward distance between di and dj equals the backward distance between dj and di. In our example graph, for instance, the forward distance between D and E is 2; the backward distance between D and E is 1. In Table 4, some of the simpler Type II rankings for our small, hypothetical network, given an initial query of document A (for RII:1,2,3,4,5,6), term 1 (for use in RII:1+), or judge 1 (for use in RII:1*,2*,3*,4*), are summarized.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Ensemble Click Model for Web Document Ranking

Annually, web search engine providers spend more and more money on documents ranking in search engines result pages (SERP). Click models provide advantageous information for ranking documents in SERPs through modeling interactions among users and search engines. Here, three modules are employed to create a hybrid click model; the first module is a PGM-based click model, the second module in a d...

متن کامل

مدل جدیدی برای جستجوی عبارت بر اساس کمینه جابه‌جایی وزن‌دار

Finding high-quality web pages is one of the most important tasks of search engines. The relevance between the documents found and the query searched depends on the user observation and increases the complexity of ranking algorithms. The other issue is that users often explore just the first 10 to 20 results while millions of pages related to a query may exist. So search engines have to use sui...

متن کامل

A New Hybrid Method for Web Pages Ranking in Search Engines

There are many algorithms for optimizing the search engine results, ranking takes place according to one or more parameters such as; Backward Links, Forward Links, Content, click through rate and etc. The quality and performance of these algorithms depend on the listed parameters. The ranking is one of the most important components of the search engine that represents the degree of the vitality...

متن کامل

Competitive Supply Chain Network Design Considering Marketing Strategies: A Hybrid Metaheuristic Algorithm

In this paper, a comprehensive model is proposed to design a network for multi-period, multi-echelon, and multi-product inventory controlled the supply chain. Various marketing strategies and guerrilla marketing approaches are considered in the design process under the static competition condition. The goal of the proposed model is to efficiently respond to the customers’ demands in the presenc...

متن کامل

Integrating Spanish Linguistic Resources in a Web Site Assistant

This work describes a proposal to improve web document retrieval by facing the main problems in document searching: first, traditional web search engines miss documents that are relevant to the user query and retrieve many that are not. Second, the query formulation is not as accessible as it could be, and some users have difficulties in expressing boolean queries. To improve the quality of Int...

متن کامل

STARTS: Stanford Proposal for Internet Meta-Searching

Document sources are available everywhere, both within the internal networks of organizations and on the Internet. Even individual organizations use search engines from di erent vendors to index their internal document collections. These search engines are typically incompatible in that they support di erent query models and interfaces, they do not return enough information with the query resul...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2002

Exploiting Multiple Sources of Evidence of Document Relatedness in Hybrid Search Engines: A Unifying Model and Design Proposal

نویسنده

چکیده

منابع مشابه

An Ensemble Click Model for Web Document Ranking

مدل جدیدی برای جستجوی عبارت بر اساس کمینه جابه‌جایی وزن‌دار

A New Hybrid Method for Web Pages Ranking in Search Engines

Competitive Supply Chain Network Design Considering Marketing Strategies: A Hybrid Metaheuristic Algorithm

Integrating Spanish Linguistic Resources in a Web Site Assistant

STARTS: Stanford Proposal for Internet Meta-Searching

عنوان ژورنال:

اشتراک گذاری